So You Want to Build Your Own Data Center
Since the beginning, Railway’s compute has been built on top of Google Cloud Platform. The platform supported Railway's initial journey, but it has caused a multitude of problems that have posed an existential risk to our business. More importantly, building on a hyperscaler prevents us from delivering the best possible platform to our customers.
It directly affected the pricing we could offer (egress fees anyone?), limited the level of service we could deliver, and introduced engineering constraints that restricted the features we could build.
And not only is it rare that we understand why things break upstream, but also despite multi-million dollar annual spend, we get about as much support from them as you would spending $100.
So in response, we kicked off a Railway Metal project last year. Nine months later we were live with the first site in California, having designed, spec-ed, and installed everything from the fiber optic cables in the cage to the various contracts with ISPs. We’re lighting up three more data center regions as we speak.
To deliver an “infra-less” cloud experience to our customers, we’ve needed to get good fast at building out our own physical infrastructure. That’s the topic of our blogpost today.
From kicking off the Railway Metal project in January 2024, it took us five long months to get the first servers plugged in. It took us an additional three months before we felt comfortable letting our users onto the hardware (and an additional few months before we started writing about it here).
The first step was finding some space.
When you go “on-prem” in cloud-speak, you need somewhere to put your shiny servers and reliable power to keep them running. Also you want enough cooling so they don’t melt down.
In general you have three main choices: Greenfield buildout (buying or leasing a datacenter), Cage Colocation (getting a private space inside a provider's datacenter enclosed by mesh walls), or Rack colocation (leasing individual racks or partitions of racks in a colocation datacenter).
We chose the second option: a cage to give us four walls, a secure door, and a blank slate for everything else.
A cage before any racks have been fitted
The space itself doesn’t cost much, but power (and by proxy, cooling) costs the most. Depending on the geography, the $/kW rate can vary hugely — on the US west coast for example we may pay less than half as much as we pay in Singapore. Power is paid for as a fixed monthly commit, regardless of whether its consumed or not, to guarantee it will be available on-demand.
But how much power do you need?
Ideally if you’ve embarked on your data center migration mission, you should have an idea of the rough amount of compute you want to deploy. We started with a target number of vCPUs, GBs of RAM, and TBs of NVMe to match our capacity on GCP.
Using these figures, we converged on a server and CPU choice. There are many knobs to turn when doing this computation — probably worth a blogpost in itself — but the single biggest factor for us was power density e.g. how do we get the compute density we want inside of a specific power draw.
The calculations aren’t as simple as summing watts though, especially with 3-phase feeds — Cloudflare has a great blogpost covering this topic.
Power is the most critical resource for data centers, and a power outage can have extremely long recovery times. So redundancy is critical, and it’s important to have two fully independent power feeds per rack. Both feeds will share load under normal operation, but the design must be resilient to a feed going down.
To deliver this power to your servers, you’ll also want a Power Distribution Unit, which you'll select based on the number of sockets and management features it provides. The basic ones are glorified extension cords, while the ones we deploy allow control and metering of individual sockets.